Wine tasting can range from a casual pastime to a lucrative profession. For professional sommeliers, considerable time and training is required to adequately rate wine quality. Intuitively, we expect expert ratings to reflect the underlying chemical composition of the wines. To this end, we hypothesized it would be possible to predict ratings using statistical models based on the chemical makeup of wines.
To determine how accurately expert wine quality ratings can be predicted using a set of easily measured chemical components.
Two datasets of expert quality ratings of red and white Vinho Verde wines were used. The data is obtained from http://archive.ics.uci.edu/ml/datasets/wine+quality. There are total of 1599 red wines and 4898 white wines in the two datasets.
The outcome variable is wine quality. This variable is an ordinal variable theoretically ranging from 0-10. However, the observed ratings only range from 3-9, where 0 indicates poor quality and 10 is for excellent quality. The data is highly unbalanced across quality classes.
Predictor variables are 11 physiochemical wine components: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol.
Below are the boxplots showing how each predictors are distributed across quality for red and white wine datasets.
Below plots show the correlation among the 11 variables for each red and white wine datasets, respectively. Here, the darker the blue, the more positively the variables are correlated and the darker the red, the more negatively the variables are correlated.
For the analysis, the red and white wine datasets were split into training and testing sets. The training data was sampled as 80% of the total available data and the remaining 20% was used as a testing data. We had training and test datasets for each red and white wine. The training data was used to train the model and the test data was used to evaluate the prediction accuracy of the model. Since the quality variable was highly unbalanced in the full dataset, the relative frequencies of this variable were preserved in both training and test data.
We modeled wine quality ratings using the random forest machine learning method and three likelihood based models: linear, partial proportional odds, and multinomial regression. The details of this models are given below:
Since the observed wine ratings were integers, it was unclear whether rating should be considered a continuous or ordinal random variable. As an initial step, we assumed the ratings were continuous and fit an OLS linear regression model as follows:
\[Z = X^T \beta + \epsilon,\ \text{where}\ \epsilon\sim N(0,\sigma^2 I)\] where \(Z\) is the response variable, quality, \(X\) is the design matrix with predictor variables, \(\beta\) are the regression coefficients, and we assume the response follows normal distribution with the distribution assumption for \(\epsilon\). It is well known that the least squares estimate of \(\beta\) is \[\hat\beta = (X^T X)^{-1}X^T Z.\] Since the predicted value of a linear model is continuous but the quality ratings were integers, we rounded the predicted linear regression values to the nearest integer to obtain the final predicted quality rating.
To account for the ordinal nature of wine ratings without assuming continuity, we implemented three cumulative logit models: proportional odds, partial proportional odds, and non-proportional odds. Letting \(Z\) correspond to level of wine rating , \(i\) correspond to a specific wine, and \(x_i\) correspond to the chemical components of wine i, the models are given below:
Where \(\alpha_j > \alpha_i\) for \(j > i\), and \(x^T_{i*}\) and \(x^T_{i**}\) are submatrices of \(x_i\).
The log likelihood for each model was maximized using the BFGS algorithm as implemented in the R package optimx (alternatively, users of our R package could specify any available method implemented in optimx). For models that converged, the predicted classification was taken to be the rating with the highest predicted probability. An S3 predict method was implemented in sommelieR to facilitate these predictions.
We were also interested in testing if we could relax the ordinality assumption of our data and still retain predictive accuracy. Therefore, we decided to fit a multinomial regression model to our data as well. Consider a random variable, Y, which takes on values in 1,…,K classes, with covariates X. The multinomial regression model has the form,
\[\log \frac{P(Y = k | X = x)}{P(Y = K | X = x)} = X\beta_k,\ \ k = 1,...,(K-1)\]
where \(\beta_k\) is a \(p \times 1\) vector of regression coefficient corresponding to level, k. Additionally, we apply the constraint that the probabilities sum to one, i.e. \(\sum_{i = 1}^K P(Y = i | X = x) = 1\). To get the probabilties we used for preduction, we solve the above system of equations given the above constraint:
\[P(Y = k | X = x) = \frac{\exp(X\beta_k)}{1 + \sum_{l = 1}^{K - 1}\exp(X\beta_l)},k = 1,..., K-1 \\ P(Y = K | X = x) = \frac{1}{1 + \sum_{l = 1}^{K - 1}\exp(X\beta_l)}\]
For both red and white wine we fit three multinomial regression models:
where the coefficients in the reduced model - and how they were selected - are mentioned below. While we would have liked to fit models with higher order terms (quadratic terms, interactions), with our current implementation of multinomial regression - this was impractical. For a response variable taking on K classes, each new variable added to the model adds (K-1) coefficients, which made fitting a full second order model challenging without binning our response variable. In the future, this challenge could be overcome through programming in C++ or using a penalization for each row of the coefficent matrix (since you could not simply penalize each individual coefficient).
For the random forest models, we used all 11 physiochemical variables as predictors. For the likelihood based models, we considered full and reduced models. The full models for both red and white wines included all 11 physiochemical variables. The variables in the reduced models were determined by looking at the correlations between predictors and using best subsets based on OLS regression. The covariates selected for the red model were volatile acidity, total sulfur dioxide, pH, alcohol, and sulphates. The covariates selected for the white model were pH, volatile_acidity, residual_sugar, and alcohol.
We compared the four models on overall accuracy (correct predictions/total), kappa, and weighted kappa. Kappa is a commonly used statistic for capturing how well the classification is done compared to a 50% random chance classification. Weighted kappa is an extension of kappa that is a more useful for data with inherent ordering since it penalizes misclassifications proportional to the distance from the true category. For instance, when the true quality rating is 4, a prediction of 7 will be penalized more severely than a prediction of 5. Since our outcome variable, quality, is ordered, we used the weighted Kappa to select the final model.
The predictive performance of the full and reduced linear regression models on the testing data are given below:
| Accuracy | Kappa | Weighted Kappa | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Full (Red) | 57.73 | 0.2998 | 0.4996 | 0 | 0 | 67.65 | 64.57 | 23.08 | 0 | NA |
| Reduced (Red) | 56.15 | 0.2728 | 0.4545 | 0 | 0 | 65.44 | 63.78 | 20.51 | 0 | NA |
| Full (White) | 52.61 | 0.2162 | 0.4211 | 0 | 0 | 39.86 | 81.78 | 22.16 | 0 | 0 |
| Reduced (White) | 51.02 | 0.1975 | 0.3943 | 0 | 0 | 43.99 | 77.22 | 18.18 | 0 | 0 |
Below is a visualization of the confusion matrix for red wine. We can see that the linear model tends to predict most of the quality towards the mean value since the model captures the population mean and also the data is highly unbalanced with most of the data concentrated around the 5 and 6 quality ratings.
This concentration towards the mean trend is more distinct with the white wine, where most of the prediction is 6.
For models fit on the white wine training data, only proportional odds models converged to parameter estimates that produced non-negative fitted probabilities for each subject and quality rating. The code for fitting these models is given below (note that the models objects are saved to shorten the compile time for this report):
#get starting values for model beta's
beta.starts <- coef(lm(quality ~ alcohol+ pH + volatile.acidity + residual.sugar,
data = white_train))
#proportional odds, reduced model
white.prop.odds <- partial.prop.odds.mod(y ="quality", in.data = white_train,
prop.odds.formula = ~ alcohol+ pH + volatile.acidity + residual.sugar,
beta.prop.odds.start = beta.starts[c(2:5)],
method = "BFGS")
saveRDS(white.prop.odds, "proportional_odds_models/white.prop.odds.reduced.rds")
#proportional odds, full model
beta.starts <- coef(lm(quality ~ alcohol+ pH + volatile.acidity + residual.sugar + fixed.acidity + citric.acid +
chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + sulphates,
data = white_train))
white.prop.odds <- partial.prop.odds.mod(y ="quality", in.data = white_train,
prop.odds.formula = ~ alcohol+ pH + volatile.acidity + residual.sugar +
fixed.acidity + citric.acid + chlorides + free.sulfur.dioxide +
total.sulfur.dioxide + density + sulphates,
beta.prop.odds.start = beta.starts[c(2:12)],
method = "BFGS")
saveRDS(white.prop.odds, "proportional_odds_models/white.prop.odds.rds")
Then using the predict method implemented for this class, we predicted the ratings in the testing data as follows:
#load saved models
white.prop.odds.reduced <- readRDS("proportional_odds_models/white.prop.odds.reduced.rds")
white.prop.odds <- readRDS("proportional_odds_models/white.prop.odds.rds")
#predict values
white.preds <- predict(white.prop.odds, white_test)$most.likely
white.preds.reduced <- predict(white.prop.odds.reduced, white_test)$most.likely
The performance for the full and reduced models are given below. Notably, the full model performed better than the reduced, but only very slightly.
| Prediction Accuracy | Kappa | Weighted Kappa | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Proportional Odds (F) | 52.1472 | 0.2126 | 0.4053 | 0 | 0 | 45.3608 | 78.1321 | 19.8864 | 0 | 0 |
| Proportional Odds (R) | 51.7382 | 0.2108 | 0.3993 | 0 | 0 | 51.2027 | 74.9431 | 15.9091 | 0 | 0 |
Likewise, the confusion matrix for the full model, using the plotting functionality implemented in our package, is given below:
#plotting confusion matrix
cmPlot(white.pred.table, "white", pred_first = T, title = "Proportional Odds Model Predicted vs Observed", axis.title.size = 20, axis.text.size = 25)
For models fit to the red training data, both proportional and partial proportional models converged. The results presented for the partial proportional model allow the coefficient for total sulfur dioxide to vary with the level of wine quality. The code for producing these models is given below:
#reduced red wine models
beta.starts <- coef(lm(quality ~ alcohol+ pH + volatile.acidity + sulphates + total.sulfur.dioxide,
data = red_train))
red.partial.prop.reduced <- partial.prop.odds.mod(y ="quality", in.data = red_train,
prop.odds.formula = ~ alcohol + pH+ volatile.acidity + sulphates,
beta.prop.odds.start = beta.starts[2:5],
non.prop.odds.formula = ~total.sulfur.dioxide,
beta.non.prop.odds.start = matrix(rep(beta.starts[6], 5), nrow = 1),
method = "BFGS")
red.prop.odds.reduced <- partial.prop.odds.mod(y ="quality", in.data = red_train,
prop.odds.formula = ~ alcohol + pH + volatile.acidity + sulphates
+ total.sulfur.dioxide,
beta.prop.odds.start = beta.starts[2:6],
method = "BFGS")
saveRDS(red.partial.prop.reduced, "proportional_odds_models/red.partial.prop.reduced.rds")
saveRDS(red.prop.odds.reduced, "proportional_odds_models/red.prop.odds.reduced.rds")
#full models
beta.starts <- coef(lm(quality ~ alcohol+ pH + volatile.acidity + residual.sugar + fixed.acidity + citric.acid +
chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + sulphates,
data = red_train))
#partial proportional odds model
red.partial.prop <- partial.prop.odds.mod(y ="quality", in.data = red_train,
prop.odds.formula = ~ alcohol+ pH + volatile.acidity + residual.sugar +
fixed.acidity + citric.acid + chlorides + free.sulfur.dioxide +
density + sulphates,
beta.prop.odds.start = beta.starts[c(2:9, 11:12)],
non.prop.odds.formula = ~total.sulfur.dioxide,
beta.non.prop.odds.start = matrix(rep(beta.starts[10], 5), nrow = 1),
method = "BFGS")
#proportional odds model
red.prop.odds <- partial.prop.odds.mod(y ="quality", in.data = red_train,
prop.odds.formula = ~ alcohol+ pH + volatile.acidity + residual.sugar +
fixed.acidity + citric.acid + chlorides + free.sulfur.dioxide +
total.sulfur.dioxide + density + sulphates,
beta.prop.odds.start = beta.starts[2:12],
method = "BFGS")
saveRDS(red.partial.prop, "proportional_odds_models/red.partial.prop.rds")
saveRDS(red.prop.odds, "proportional_odds_models/red.prop.odds.rds")
Predictions on the testing data were again made using our S3 predict method (code as shown for white wines). The performance of the trained models on the testing data are given below. Briefly, the proportional and partial proportional odds models performed similarly, and the full models performed very slightly better than the reduced models.
| Prediction Accuracy | Kappa | Weighted Kappa | 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|---|---|
| Proportional Odds (F) | 58.9905 | 0.3232 | 0.5258 | 0 | 0 | 72.0588 | 59.8425 | 33.3333 | 0 |
| Proportional Odds (R) | 58.0442 | 0.3065 | 0.4707 | 0 | 0 | 72.0588 | 59.8425 | 25.6410 | 0 |
| Partial Proportional Odds (F) | 58.9905 | 0.3233 | 0.5162 | 0 | 0 | 72.0588 | 60.6299 | 30.7692 | 0 |
| Partial Proportional Odds (R) | 58.3596 | 0.3117 | 0.4742 | 0 | 0 | 72.0588 | 59.8425 | 28.2051 | 0 |
As with the white wines, the confusion matrices for the full proportional and partial proportional red models are shown below:
| Model | Accuracy | Kappa | Weighted Kappa | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|---|
| Full Model (Linear Terms) | 58.3596 | 0.3158 | 0.5227 | 0 | 20 | 74.2647 | 55.9055 | 28.2051 | 0 |
| Reduced Model (Linear Terms) | 58.3596 | 0.3193 | 0.4952 | 0 | 10 | 72.7941 | 56.6929 | 33.3333 | 0 |
| Reduced Model (Second Order Terms) | 55.2050 | 0.2702 | 0.4789 | 0 | 0 | 67.6471 | 55.1181 | 33.3333 | 0 |
| Model | Accuracy | Kappa | Weighted Kappa | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|---|
| Full Model (Linear Terms) | 54.0900 | 0.2451 | 0.4121 | 0 | 9.375 | 51.2027 | 79.4989 | 15.9091 | 0 | 0 |
| Reduced Model (Linear Terms) | 51.9427 | 0.2201 | 0.4093 | 0 | 3.125 | 54.6392 | 72.6651 | 16.4773 | 0 | 0 |
| Reduced Model (Second Order Terms) | 53.2720 | 0.2396 | 0.4010 | 0 | 3.125 | 54.2955 | 74.4875 | 19.8864 | 0 | 0 |
| Prediction Accuracy | Kappa | Weighted Kappa | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Red Wine | 70.98 | 0.5263 | 0.6168 | 0 | 0 | 83.82 | 70.87 | 53.85 | 0.00 | NA |
| White Wine | 67.28 | 0.4862 | 0.6542 | 0 | 25 | 67.35 | 80.87 | 47.73 | 42.86 | 0 |
Comparisons of the performance of the different modeling approaches are given below:
| Model | Prediction Accuracy | Kappa | Weighted Kappa | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|---|
| Random Forest | 70.9800 | 0.5263 | 0.6168 | 0 | 0 | 83.8200 | 70.8700 | 53.8500 | 0 |
| Proportional Odds | 58.9905 | 0.3232 | 0.5258 | 0 | 0 | 72.0588 | 59.8425 | 33.3333 | 0 |
| Multinomial | 58.3596 | 0.3158 | 0.5227 | 0 | 20 | 74.2647 | 55.9055 | 28.2051 | 0 |
| Partial Proportional Odds | 58.9905 | 0.3233 | 0.5162 | 0 | 0 | 72.0588 | 60.6299 | 30.7692 | 0 |
| Linear Regression | 57.7300 | 0.2998 | 0.4996 | 0 | 0 | 67.6500 | 64.5700 | 23.0800 | 0 |
| Model | Prediction Accuracy | Kappa | Weighted Kappa | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|---|
| Random Forest | 67.2800 | 0.4862 | 0.6542 | 0 | 25.000 | 67.3500 | 80.8700 | 47.7300 | 42.86 | 0 |
| Linear Regression | 52.6100 | 0.2162 | 0.4211 | 0 | 0.000 | 39.8600 | 81.7800 | 22.1600 | 0.00 | 0 |
| Multinomial | 54.0900 | 0.2451 | 0.4121 | 0 | 9.375 | 51.2027 | 79.4989 | 15.9091 | 0.00 | 0 |
| Proportional Odds | 52.1472 | 0.2126 | 0.4053 | 0 | 0.000 | 45.3608 | 78.1321 | 19.8864 | 0.00 | 0 |
Using random forests, we predicted expert wine quality ratings fairly accurately. For white wines, the overall accuracy was 67.28%, with a weighted kappa of 0.6542. For red wines, the accuracy was 70.98% with weighted kappa 0.6168. Using the proposed cutoffs from Landis and Koch, these weighted kappa values suggest moderate to substantial agreement with the expert ratings.
The likelihood approaches performed more poorly. Accuracies were 10-15% lower than random forest and weighted kappas were 0.1 to 0.2 lower. Moreover, among the likelihood based models, predictive performance was similar and no individual approach stood out as superior. Generally, the models showed a tendency to predict ratings towards the middle of the distribution, and, interestingly, accounting for the ordered nature of the ratings did not make a substantial difference in performance. Indeed, the multinomial was more accurate than the proportional odds and linear regression models for very low and high quality wines. Overall, the weighted kappas for the likelihood based models suggested moderate agreement with expert ratings.
Our project has limitations. We only had data on a small number of chemical components, so we speculate some misclassification was due to failure to include other important chemicals in our models. Likewise, our datasets included only small numbers of extremely poor and very excellent wines making these difficult to predict. In the future, it would be worthwhile to examine whether various outlier detection algorithms are more suited to characterizing these wines than the approaches we took.
In terms of variable selection for the likelihood based models, we saw only a minor improvement in predictive performance for the full compared to reduced models suggesting we may have overfit the data. A regularization method like LASSO or elastic net with cross-validation may have been useful. Additionally, our modeling focus was purely on prediction, not on characterizing how individual chemicals relate to wine quality. An interesting extension of this project would more thoroughly investigate those relationships.
In summary, we found wine quality ratings can be predicted reasonably well based on their chemical components. Random forests performed better than likelihood based models, but all methods demonstrated at least moderate agreement with expert ratings. Nevertheless, the predictions were far from perfect, so true wine connoisseurs are still better off consulting a sommelier.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159-74.